Antipattern: Clone Tables or Columns

In the television series Star Trek, “tribbles” are small furry animals kept as pets. Tribbles are very appealing at first, but soon they reveal their tendency to reproduce out of control, and managing the overpopulation of tribbles becomes a serious problem.

Where does one put them? Who’s responsible for them? How long would it take to pick up every tribble? Eventually, Captain Kirk discovers that his ship and crew can’t function, and he has to order his crew to make it their top priority to remove the tribbles.

We know from experience that querying a table with a few rows is quicker than querying a table with many rows, all other things being equal. This leads to a common fallacy: we must make every table contain fewer rows, no matter what we do. This leads to two forms of antipattern:

  • Splitting a single long table into multiple smaller tables and naming each table based on the distinct data values of one of the table’s attributes.

  • Splitting a single column into multiple columns, using column names based on distinct values as another attribute.

But we cannot get something for nothing; to meet the goal of having fewer rows in every table, we have to either create tables that have too many columns or else create a greater number of tables. In both cases, we find that the number of tables or columns keep growing as new data values make us create new schema objects.

Spawning tables#

To split data into separate tables, we need to define a policy about which rows belong in which tables. For example, we could split them up by the year in the date_reported column:

Creating individual Bugs tables for each year

Mixing Metadata with Data

As we insert rows into the database, it’s our responsibility to use the correct table, depending on the values we insert:

Inserting correct data in Bugs_2010 table

Fast forwarding to January 1 of the following year, it is easy to imagine our application starting to get an error from all new bug reports because we didn’t remember to create the Bugs_2011 table.

Inserting correct data in Bugs_2011 table

This means that introducing a new data value can cause a need for a new metadata object. This is not usually the relationship between data and metadata in SQL.

Managing data integrity#

Let’s suppose that our boss is trying to count the number of bugs reported during the year, but their numbers don’t add up. After investigating, we can see that a few bugs for the year 2010 were entered in the Bugs_2009 table by mistake. The following query should always return an empty result, but in this case it likely doesn’t:

Checking for invalid data in Bugs_2009 table

There’s no way to limit the data relative to the name of its table automatically, but we can declare a CHECK constraint in each of our tables:

Adding a check to restrict invalid data to be entered

We must take care to remember to adjust the value in the CHECK constraint when we create Bugs_2011. If we make a mistake, we could create a table that rejects the rows it’s supposed to accept.

Synchronizing data#

One day, a customer support analyst asks to change a bug report date. It’s in the database as reported on 2010-01-03, but the customer who reported it actually sent it in by fax a week earlier, on 2009-12-27. We could change the date with a simple UPDATE:

Trying to update a value that should be entered in another table

Let’s retrieve the data after using this statement.

Retrieving the data after updating a value which should be entered in another table

But this correction makes the row an invalid entry in the Bugs_2010 table. We would need to remove the row from one table and insert it into the other table in the unlikely case that a simple UPDATE would cause this anomaly.

Shifting the data to another table to which it belongs

Let’s see the working example of this query in the next playground.

Retrieving the data after shifting it to another table to which it belongs

Ensuring uniqueness#

We should make sure that the primary key values are unique across all the split tables. If we need to move a row from one table to another, we need some assurance that the primary key value doesn’t conflict with another row.

If we use a database that supports sequence objects, we can use a single sequence to generate values for all the split tables. For databases that support only per-table ID uniqueness, this may be more awkward. We have to define an extra table solely to produce primary key values:

Creating a table for generating primary key

Querying across tables#

Inevitably, we may need to make a query that references multiple tables. For example, our boss may ask for a count of all open bugs regardless of the year they were created. We can reconstruct the full set of bugs using a UNION of all the split tables and query that as a derived table:

Counting the number of bugs regardless of the year

As the years go on and we create more tables such asBugs_2011, we need to be especially careful to keep our application code up-to-date to reference the newly created tables.

Synchronizing metadata#

Let’s imagine now that our boss tells us to add a column to track the hours of work required to resolve each bug.

Adding a column in Bugs_2010 table

If we’ve split the table, then the new column applies only to the table that we alter. None of the other tables contains the new column.

If, instead, we use a UNION query across our split tables, as in the previous section, we stumble upon a new problem: we cannot combine tables using UNION if they don’t have the same columns. If they differ, then we have to name only the columns that all tables have in common, without using the wildcard operator (*).

Managing referential integrity#

If a dependent table like Comments references Bugs, the dependent table cannot declare a foreign key. A foreign key must specify a single table, but here the parent table is split into many.

Creating the Comments table and referring to different parent tables would be contrary

The split table may also have problems being a dependent instead of a parent. For example, Bugs.reported_by references the Accounts table. If we want to query all bugs reported by a given person regardless of the year, we can use a query like this:

Querying bugs reported by a given person

Identifying metadata tribbles columns#

Columns can be Metadata Tribbles, too. We can create a table containing columns that are bound to propagate by their nature, as we saw in the story at the beginning of this chapter.

Another example of a problem we may have in our bugs database is a table that records summary data for project metrics, where individual columns store subtotals. For instance, in the following table, it’s only a matter of time before we need to add the column bugs_fixed_2011:

Creating a table ProjectHistory for storing fixed bugs
Synopsis: Metadata Tribbles
Solution: Partition and Normalize
Mark as Completed
Report an Issue